Read and format project data
# Include and execute your code here
df = pd.read_csv("https://github.com/byuidatascience/data4names/raw/master/data-raw/names_year/names_year.csv")Course DS 250
Josh Schaefer
paste your elevator pitch here A SHORT (4-5 SENTENCES) PARAGRAPH THAT DESCRIBES KEY INSIGHTS TAKEN FROM METRICS IN THE PROJECT RESULTS THINK TOP OR MOST IMPORTANT RESULTS.
Highlight the Questions and Tasks
How does your name at your birth year compare to its use historically?
Looking at the line chart my name used at my current birth year(2004) is not that much out of the ordinary. In the year 2004 the name “Joshua” is used a lot, but not as much as about a decade before. Looking at the table that I made at the bottom shows the amount of people named “Joshua” in 2004 which is 20838 while the average for all of the years provided in the dataset is 17624. Historically I am more likely to be born in 2004 because it is higher than the average.
# Include and execute your code here
joshua_data = df.query("name == 'Joshua'")
total_joshuas_2004 = joshua_data.query("year == 2004")['Total'].sum()
fig = px.line(joshua_data, x='year', y='Total', title='Occurrences of the name "Joshua" over the years')
fig.add_annotation(
dict(
x=2004,
y=total_joshuas_2004.max(),
text='Year I was born(2004)',
showarrow=True,
arrowhead=2,
arrowcolor='red',
arrowwidth=2,
ax=20,
ay=-80,
)
)
fig.show()include figures in chunks and discuss your findings in the figure.
::: {#cell-Q1 chart .cell execution_count=4}
:::
If you talked to someone named Brittany on the phone, what is your guess of his or her age? What ages would you not guess?
Looking at the bar chart that I created above I could get a pretty good idea of what year Brittany was most likely to be born in. My guess just from eyeballing it was between 31-37. I wanted to get the exact answer so I decided to make a program to do so. I did this by adding up all of the years in the data set in which Brittany was born and then dividing it by the total occurrences they were born. This gave me the year 1991. I subtracted that from 2024 to get 33. So if I was on the phone I would guess someone named Brittany would be around 33. I would not guess that they were younger than 20 and older than 45.
In this cell I simplely just got all of the brittany data using the query method. I can use the data for my upcoming charts and calculations.
::: {#cell-Q2 chart .cell execution_count=6}
:::
Average year of Brittany: 33
Mary, Martha, Peter, and Paul are all Christian names. From 1920 - 2000, compare the name usage of each of the four names. What trends do you notice?
After creating both the line chart and the table I can see that at around 1952 was the peak of the use of all of these Christian names. After that it started slowly declining until around 1980 where you didn’t really see these names used as much. I find it super intresting why these names started not getting used anymore. I wonder if Christian names became less popular or the world just moved onto new names.
# Include and execute your code here
# Create a query to filter the data for the specified names
# Use query method to filter based on the values in the "name" column
names_data = df.query("name in ['Mary', 'Martha', 'Peter', 'Paul']")
mary_data = names_data.query("name == 'Mary'")
martha_data = names_data.query("name == 'Martha'")
peter_data = names_data.query("name == 'Peter'")
paul_data = names_data.query("name == 'Paul'")include figures in chunks and discuss your findings in the figure.
::: {#cell-Q3 chart .cell execution_count=9}
My useless chart
:::
::: {#cell-Q3 table .cell .tbl-cap-location-top tbl-cap=‘Not much of a table’ execution_count=10}
# Include and execute your code here
mary_peak_year = mary_data.loc[mary_data['Total'].idxmax(), 'year']
martha_peak_year = martha_data.loc[martha_data['Total'].idxmax(), 'year']
peter_peak_year = peter_data.loc[peter_data['Total'].idxmax(), 'year']
paul_peak_year = paul_data.loc[paul_data['Total'].idxmax(), 'year']
peak_years_df = pd.DataFrame({
'Name': ['Mary', 'Martha', 'Peter', 'Paul', 'Average'],
'Total': [mary_data['Total'].max(), martha_data['Total'].max(), peter_data['Total'].max(), paul_data['Total'].max(), None],
'Year': [mary_peak_year, martha_peak_year, peter_peak_year, paul_peak_year, None]
})
peak_years_df['Total'] = peak_years_df['Total'].round()
peak_years_df['Year'] = peak_years_df['Year'].round()
average_peak = peak_years_df['Year'][:-1].mean()
average_peak = math.ceil(average_peak)
peak_years_df.loc[4, 'Year'] = average_peak
display(peak_years_df)| Name | Total | Year | |
|---|---|---|---|
| 0 | Mary | 53791.0 | 1950.0 |
| 1 | Martha | 10651.0 | 1947.0 |
| 2 | Peter | 11321.0 | 1956.0 |
| 3 | Paul | 25662.0 | 1954.0 |
| 4 | Average | NaN | 1952.0 |
:::
In this table total is the total amount of the highest year. The highest year is the year that the name was most used.